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Abstract 

Digestor is a software system which automatically re-authors arbitrary documents from the 
World-Wide Web to display appropriately on small screen devices such as PDAs and cellular 
phones, providing device-independent access to the Web. Digestor is implemented as an HTTP 
proxy which dynamically re-authors requested Web pages using a heuristic planning algorithm 
and a set of structural page transformations to achieve the best looking document for a given 
display size. 

1. Introduction 



Access to World-Wide Web documents from personal electronic 
devices has been demonstrated in research projects [2, 1 0, 1 7, 1 8], 
and is now becoming a commercial reality. General Magic's 
Presto! Links for Sony's MagicLink, AllPen's NetHopper for the 
Newton and Sharp's MI- 10 (Figure 1, shown at right), all provide 
WWW browsers for PDA class devices, while the Nokia 9000 
Communicator and Samsung's Duett provide Web access capabilities 
from cellular phones. 




Figure 1. 
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Figure 2. 



Unfortunately, most pages on the World-Wide Web are 
designed for display on desktop computers with color 
monitors having at least 640x480 resolution, and many 
pages are designed with even larger monitors in mind. This 
disconnect can lead to a 4-to-l (or greater) ratio of 
designed vs. available screen area, making direct 
presentation of most WWW pages on these small devices 
aesthetically unpleasant, un-navigable, and, in the worst 
case, completely illegible. This presents the central problem 
addressed by this work; how to display arbitrary Web pages 
which have been designed for desktop systems on personal 
electronic devices which have much more limited I/O 
capabilities. 



2. Approaches to Device-independent Access 

There are four general approaches to displaying WWW pages on small screen devices: device-specific 
authoring; multiple-device authoring; client-side navigation; and automatic re-authoring. 

Device-specific authoring involves authoring a set of WWW pages for a particular display device, for example a 
cellular phone outfitted with a display and communications software such as the Nokia 9000. The basic 
philosophy in this approach is that users of such specialty devices will only have access to a select set of 
services, and the pages for these services can all be designed up-front for the device's particular display. 
Information may be provided from the world-wide web at large, but the desired pages must be pre-defined and 
custom information extraction and page formatting software must be written to deliver the information to the 
small device. This is the approach taken in Unwired Planet's UP.Link service [1 6] which uses a proprietary 
mark-up language (HDML). 

In multiple-device authoring, a range of target devices is identified, and mappings from a single source document 
to a set of rendered documents are defined to cover the devices within the range. One example of this is the 
StretchText approach [8], in which portions of the document (potentially down to the word level) can be tagged 
with a 'level of abstraction' measure. Upon receiving the document, users can specify the level of abstraction 
they wish to view and are presented with the corresponding detail or lack thereof. Another example of 
multiple-device authoring is HTML cascading style sheets (CSS) [13]. In CSS, a single style sheet defines a set 
of display attributes for difFerent structural portions of a document (e.g., all top-level section headings are to be 
displayed in red 18-point Times font). A series of style sheets may be attached to a document, each with a 
weight describing its desirability to the document's author. The user can also specify a style sheet, as can the 
WWW browser (the 'default' style sheet). Although the author's style sheets normally override the user's, the 
user can selectively enable or disable the author's, providing the ability to tailor the rendering of the document to 
their particular display. 

In client-side navigation, the user is given the ability to interactively navigate a single web page by altering the 
portion of it that is displayed at any given time. A very trivial example of this is the use of scroll bars on the 
document display area. A much more sophisticated approach is that taken in the PAD-H- system [3], in which 
the user is free to zoom and pan the device display over the document with infinite resolution. Active Outlining 
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[1 1] has also been implemented as a client-side navigation technique, in which the user can dynamically expand 
and collapse sections of the document under their respective section headings. Other techniques which fall into 
this category include semi-transparent widgets [12] and the Magic Lens system [4]. 

Finally, automatic re-authoring involves developing software which can take an arbitrary web document 
designed for the desktop, along with characteristics of the target display device, and re-author the document 
through a series of transformations so that it can be appropriately displayed on the device. This process can be 
performed either on the client, on the server, or on an intermediary HTTP proxy server (as in [7]) which exists 
solely for the purpose of providing these transformation services. An example of this latter approach is the UC 
Berkeley Pythia proxy [9], which performs transformations on web page images, although the focus of this work 
is on minimizing page retrieval time, not on producing the most appropriate page layout for the display device. 

There are many possible automatic re-authoring techniques, which can be categorized along two dimensions: 
syntactic vs. semantic and transformation vs. elision. Syntactic techniques operate on the structure of the page, 
while semantic techniques rely on some understanding of the content. Elision techniques basically remove some 
information, leaving everything else untouched, while transformation techniques involve modifying some aspect 
of the page's presentation or content. Table 1 illustrates these dimensions, along with examples of each 
category. 



Table 1 
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Image Reduction | 
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Removing 'Irrelevant Content 
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Each of the four approaches-device-specific authoring, multiple-device authoring, client-side navigation, and 
automatic re-authoring-has benefits and drawbacks. Device-specific authoring will typically yield the 
best-looking results, but limits the user's access to a small select set of web pages. Multiple-device authoring, 
while less total effort per document than device-specific authoring, still requires significantly more manual design 
work than simply authoring for a single desktop platform. Client-side navigation holds promise if a good set of 
techniques can be developed, but the 'peephole' approach taken in PAD++ seems very awkward to use for 
large documents, and the active outlining technique has limited applicability since most web pages do not use a 
strict section/sub-section organization. Automatic re-authoring is thus the ideal approach to providing broad 
access to the web from a wide range of devices, if it can be made to produce legible, navigable and aesthetically 
pleasing re-authored documents without loss of information. 

3. Understanding the Re-authoring Process 

In order to gain an understanding of the process required of an automated re-authoring system, a study was 
conducted to assess the characteristics of typical web pages, and to identify candidate re-authoring techniques 
through the process of re-authoring several web pages by hand. 
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A relatively small sample of 'typical' web pages-the Xerox Corporate web site [20]--was initially selected to 
focus the study. This collection of 3,188 web pages is representative of a state-of-the-art, 
professionally-designed site. A variety of statistics were collected on these pages using a web crawler, to help 
gain an understanding of the structure and content of a typical page. These statistics generally agree with other, 
larger-scale studies that have been performed across the entire web [6,1 9]. 

Next, a subset of the pages in the Xerox web site was selected for manual re-authoring. A set of pages from the 
Xerox 1 995 Annual Report were selected and converted by hand for display on a Sharp Zaurus PDA with a 
320x240-pixel screen. Detailed notes were kept of the design strategies and techniques used. 

Some of the design heuristics learned during this process were: 

• Keeping at least some of the original images is important to maintain the look and feel of the original 
document Common techniques include keeping only the first, or only the first and last image (bookend 
images) and eliding the rest. 

• Section headers (HI - H6 tags) are not often used correctly. They are more frequently used to achieve 
a particular font size and style (e.g., bold), if they are used at all. Thus, they cannot be relied upon to 
provide a structural outline for most documents. Instead, documents with many text blocks can be 
reduced by replacing each text block with the first sentence or phrase of each block {first sentence 
elision), 

• An initial rule of thumb for images is to reduce them all in size by a standard percentage, dictated by the 
ratio of the display area that the document was authored for to the display area of the target device. 
Images which contain text or numbers can be reduced by only a small amount before their contents 
become illegible. 

• Semantic elision can be performed on sidebars which present information which is tangential to the main 
concepts presented in a page. Many of the Xerox pages had such sidebars which were simply 
eliminated in the reduced versions. , 

• Semantic elision can also be performed on images which do not contribute any information to the page, 
but serve only to improve its aesthetics. 

• Pages can be categorized, and then re-authored based on their category. Two examples of these are 
banners and link tables. Banners primarily contain a set of images and a small number of navigation 
links (often only one) which serve to establish an aesthetic look, but contain little or no content When 
space is at a premium, these can usually be omitted entirely. Link table pages consist primarily of a set 
of hypertext links to other pages, and very little additional content. These pages can usually be 
re-formatted into a more compact form which just lists the links in a text block. 
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• White space, which is taken for granted on a large display, is at a premium on small devices. Several 
techniques were discovered for reducing the amount of white space in a page. Sequences of paragraphs 
(P tags) or breaks (BR tags) can be collapsed into one. Lists (UL, OL, and DL tags) take up valuable 
horizontal space with their indenting and bullets, and can be re-formatted into simple text blocks with 
breaks between successive items (also observed in [8]). 

In conclusion, to perform document re-authoring two things are required: a set of re-authoring techniques (a 
*bag of tricks 1 ), and a strategy for applying them. Of the techniques used in the manual re-authoring study, those 
most amenable to codification were the syntactic elision techniques (section outlining, first sentence elision, 
image elision) and the syntactic transformation techniques (image size reduction, font size reduction). The design 
strategy learned during the study consisted of a ranking of the transformation techniques (i.e., try this before 
that) and a set of conditions under which each transformation or combination of transformations should be 
applied. 

4. Digestor System Design 

Following the results of the study discussed above, there are two major elements to Digester's design: a 
collection of individual re-authoring techniques which transform documents in various ways; and an automated 
re-authoring system which implements a design strategy by selecting the best combination of techniques for a 
given document/display size pair. 

4.1. Re-Authoring Techniques 

4.1.1 Outlining 




Figure 3. 



Section header outlining techniques provide a very good method for reducing the required display size for 
structured documents, such as technical papers and reports. The outlining process is depicted in Figure 3 at 
right. The contents of each section is elided from the document and the section header is converted into a 
hypertext link which, when selected, loads the elided content into the browser. When confronted with multiple 
section levels (sections, subsections, sub-subsections, etc.), there are two approaches to performing the elision. 
The first-full outlining-vsorks by keeping only the section headers and eliding all content, with the results 
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looking like a table of contents for a book. In the second approach~/o-/eve/ outlining-a cutoff level in the 
section hierarchy is determined and all content below that level (including lower-level section headers) is elided, 
but all content above that level is kept. 

4.1.2 First Sentence Elision 

Since most pages have text blocks, even when no section headers are present, first sentence elision can be a 
good way of reducing required screen area. In this technique, each text block is replaced with its first sentence 
(or phrase up to some natural break point), and this sentence is also made into a hypertext link to the original 
text block. 

4.1.3 Image Reduction and Elision 

Images present one of the most difficult problems for automatic re-authoring, because the decision of whether to 
keep, reduce, or eliminate a given image should be based on an understanding of the content and role of the 
image on the page. However, image reduction and elision can be applied without content understanding, as long 
as users are provided a mechanism by which they can retrieve the original image. The approach taken in 
Digestor is to provide a set of techniques which transform all images in a page by pre-defined scaling factors 
(25%, 50%, and 75%), and making the reduced images hypertext links back to the originals. In addition to 
image reduction, three syntactic elision techniques were also developed for images-elide all, first image only, 
and bookends-in which all images, all but the first, and all but the first and last are elided from the document, 
respectively. Elided images are replaced with their ALT text when available, or with a standard icon when no 
ALT text is available, which is also made into a hypertext link to the original image. 

4.2. Automated Re-authoring System 

The overall process of deciding which combination of techniques to 
apply to a given page for a given client display seems at first to require 
some form of human artistic ability. However, an automatic 

re-authoring algorithm has been developed which captures many of 1 
the heuristics used in the manual re-authoring exercise, and which does 
a fairly good job of producing good looking pages for a given display. 



The basic approach is that of a heuristic planner [1] which searches a 
document transformation space in a best-first manner. Each state in 
this space represents a version of the document, with the initial state 
representing the original 'as-authored' document. The measure of merit 
(evaluation function) for each state is a rough estimate of the screen 
area required for the document as it exists in that state. A state can be 
expanded into a successor state through the application of single 
transformation technique. At every step in the search process the most 
promising state (that with the smallest display area requirements) is 
selected and a transformation applied to it, if possible. As soon as a 
state is created containing a document version that is 'good enough, 
the search is halted and that document is returned to the client for 





Figure 4. 
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rendering. If the search is exhausted and no document version can be 
found that is good enough, then the best document found during the 
search is returned. 

Heuristic information is used in several places in the planner, including: the order in which transformation 
techniques are applied to a given state; the pre-conditions for each transformation technique; and the 
determination of when a document version is f good enough' to halt the search. In general, transformations which 
make minor changes to the document are preferred over those which make more extensive changes (e.g., 
reducing images by 25% is preferable to reducing them by 75%). The pre-conditions for each transformation 
technique specify which other transformations they can be combined with (e.g., it makes no sense to apply both 
full outlining and first sentence elision to the same document), as well as requirements on the content and 
structure of the document that the technique is being applied to (e.g., only apply full outlining when there are at 
least three section headers). The current condition for 'good enough 1 is fairly simplistic; the search is stopped 
when the area required by a document version is 2.5 times the screen area of the client display (which assumes 
that the user doesn't mind scrolling the display a little in one direction). 

5. Implementation 

The design described above has been implemented as an HTTP proxy server (as in [7]). The proxy accepts a 
request for an HTML document, retrieves the document from the specified HTTP server, parses the HTML 
and constructs an Abstract Syntax Tree (AST), labels each of the AST nodes with a unique identifier, and then 
retrieves any embedded images so that their size can be determined (as necessary). Once this has been 
accomplished, the planner is initialized with a state containing the AST for the original document During each 
planning cycle, the planner selects the state with the best document version so far, then selects the best 
applicable transformation technique and applies it resulting in a new state and document version being 
generated. It is assumed that the convolution of transformations is always commutative, and several checks are 
used in the planner to ensure that redundant states are not constructed. 

Fifteen transformation techniques have been implemented and integrated into the system: FullOutline, 
OutlineToHl, OutlineToH2, OutlineToH3, OutlineToH4, OutlineToH5, OutlineToH6, FiistSentenceElision, 
Reducelmages25%, Reducelmages50%, Reducelmages75%, ElideAllImages, FirstlmageOnly, 
Bookendlmages, and ReduceFontSize. Transformations manipulate the AST in the state they are applied to in 
order to produce a new version of the document (the manipulations are similar to those described in [5]). 
Whenever portions of the AST are elided or transformed, an HTML hypertext link is added into the AST 
referencing the node identifiers of all affected AST subtrees, enabling users to request the original portions of the 
document that have been modified during re-authoring. 

The system is implemented in the Java programming language, and the HTTP proxy server software was based 
on the MBServler system [14]. In addition to functioning as a true proxy, this system can also respond to 
requests for certain URLs with documents generated by the proxy itself, which is used to provide the user with 
forms-based control over the proxy and the re-authoring process. In the worst case the planner produces 80 
versions of the document in its search space and takes about 20 seconds to complete on a 75Mhz Pentium 
(using Symantec's Java JIT compiler). 
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The first thing that a user of Digestor will typically do is specify the size of 
display for their device, and indicate the size of their default browser font 
(this is needed in order to estimate the screen area requirements of text 
blocks). To do this, the user requests a specific control URL from the 
proxy, resulting in delivery of the form shown in Figure 5. 

Once users hae configured the system, they can start retrieving documents 
from the web. The pages below in Figure 6 illustrate Digester's 
re-authoring capability. In this example, the system chose to use 25% 
image reduction in combination with first sentence elision to render the 
displayed version. Immediately following retrieval of a page, the user can 
request a trace of the re-authoring session to determine which 
transformations had been applied by requesting another control URL from 
the proxy. 



••mm*; ht***>ho -tir. * *sat.v«a i j- a: wfl/kuas. r * 

ka.J tfoi ufVa •Jualv^iHM.HAl ■&•» aaB»v„ 




r^»i. fy~>i m.u .is* > r ^w-jupfc'Jf.*. J w*aw .^-o. fmwt v. 
teat ■»>a*^i»* k i': .wnni*ik:inN »»l BUnrfaii 

M*4UJ| M ldfll d«i iJlllaf 4a?0 X MKWII 













at 




its 











'ptfSSuctivity 



Figure 6. 




" K^arrnduclfl 



Pud .l.ASuiiv 

\ liLcaxii wi.' CIla.C2:.-.\ li>v DGa.ii 



|»-/iir«nL r»i»«»ft3 w.*»*ll«cv«r ' /#ff*i'L" lift •'•ffi-h 



fttttW 



6. Conclusion 

Digestor does a good job of automatically re-authoring web pages for display on devices with small screens. It 
has been informally tested on a wide range of pages for a number of screen sizes, and it produces output which 
is always legible and navigable, although not always as aesthetically pleasing as one would like. 

6.1. Future Work 

There is much work to be done on Digestor before it could be made into a widely usable product. Some of the 
general issues which need to be addressed include: 

• Improved Measure of Merit - The current system simply adds up the space requirements of all 
images and text to arrive at an estimate of the screen area requirements for a document This is 
adequate for fairly dense documents with minimal structure, such as those in the Xerox Annual Report, 
but works poorly for documents with a lot of white space or which use advanced layout techniques 
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(e.g., tables). Ultimately, what is needed is a size estimator which performs much of the work performed 
by a browser in formatting each document version onto a display area. Factors other than required 
screen area may also need to be included, such as actual width requirements (users don't like to scroll 
horizontally), bandwidth requirements, and aesthetic measures. 

• More User Control - Users should be able to adjust the various heuristics used in the planner to suit 
their taste. For example, they could specify the relative preference of the transformation techniques, or 
specify that some transforms not be used at all. At a higher level of abstraction, they could express their 
preferences for a set of trade-offs, such as 'more content' vs. larger representation 1 . In addition, the 
re-authoring system could be moved to the client and coupled with the browser so that the user could 
dynamically apply and undo different transformations until they achieved a result they liked. 

• Shared Proxy - As implemented, Digestor can only be configured for one user at a time. Since it 
potentially has greater value as a shared public or workgroup service, it could be equipped to handle 
configuration information from a range of users-using a technique such as Netscape cookies [1 5] -to 
convey each user's preferences. 

• More Transformation Techniques -- The area in which Digestor could use the most improvement is 
its set of transformation techniques. Some specific candidates to add include: elision of advanced page 
elements (e.g., tables, applets, Shockwave plug-ins, etc.); folding of table rows or columns; 
re-formatting of lists into text blocks; folding of descriptive lists (as in [8]); compressing whitespace 
(e.g., removing multiple P or BR tags); and text block summarization. A thorough study of graphic art 
techniques could be made to derive additional candidates. Transformations could also be generalized by 
the use of user-specified parameters so that, for example, a transformation could elide all text blocks 
containing a user-specified keyword. 

Acknowledgments 

Thanks to Joe Sullivan for supporting this research. 
References 

1 . A. Barr and E. Feigenbaum. The Handbook of Artificial Intelligence (William Kaufinann, Inc., Los 
Altos, California, 1 98 1 , pp. 58-63). 

2. J. Bartlett. Experience with a Wireless World Wide Web Client. IEEE COMPCON 95 (San j 
Francisco, March 1 995). 

3. B. Bederson and J. Hollan. Pad++: A Zooming Graphical Interface for Exploring Alternate Interface 
Physics. Proceedings of ACM UIST '94 (ACM Press, 1994). 

4. E. Bier, M. Stone, K. Pier, W. Buxton, T. DeRose. Toolglass and Magic Lenses: The See-through 
Interface. SIGGRAPH '93 Conference Proceedings (1993). 

5. S. Bonhomme and C. Roisin. Interactively Restructuring HTML Documents. Fifth International World 
Wide Web Conference (Paris, May 1996). 

6. T. Bray. Measuring the Web. Fifth International World Wide Web Conference (Paris, May 1996). 

7. C. Brooks, M. Mazer, S. Meeks, J. Miller. Application-Specific Proxy Servers as HTTP Stream 
Transducers. Fourth International World Wide Web Conference (Boston, December 1995). 

8. I. Cooper and R. Shufflebotham. PDA Web Browsers: Implementation Issues (University of Kent at 
Canterbury Computing Laboratory WWW Page, November 1995). 

9. A. Fox and E. Brewer. Reducing WWW Latency and Bandwidth Requirements by Real-Time 
Distillation. Fifth International World Wide Web Conference (Paris, May 1996). 



Page 9 



'Digestor: Device-Independent Access to the World Wide Web 

10. S. Gessler and A. Kotulla. PDAs as Mobile WWW Browsers. Second International World Wide Web 
Conference (Chicago, October 1994). 

11. J. Hsu, W. Johnston, and J. McCarthy. Active Outlining for HTML Documents: An X-Mosaic 
Implementation. Second International World Wide Web Conference (Chicago, October 1 994). 

12. T. Kamba, S. Elson, T. Harpold, T. Stamper, and P. Sukaviriya. Using small screen space more 
efficiently. CHI 96 (Vancouver, BC Canada, April 1996). 

13. H. Lie and Bert Bos. Cascading Style Sheets. WWW Consortium, (September 1 996). 

14. Mort Bay Consulting. MBServler: Mort Bay's HTTP Servlet Serving Server. 

15. Netscape Communications Corporation. Persistent Client State HTTP Cookies. 

16. Unwired Planet, Inc. UP.Link Developer's Guide Version 1.0. Redwood Shores, California, July 1996. 

1 7. G. Voelker and B. Bershad. Mobisaic: An Information System for a Mobile Wireless Computing 
Environment. Workship on Mobile Computing Systems and Applications (Santa Cruz, December 
1994). 

18. T. Watson. Application Design for Wireless Computing. 1994 Mobile Computing Systems and 
Applications Workshop Position Paper (August 1 994). 

1 9. A. Woodruff, P. Aoki, E. Brewer, P. Gauthier, and L. Rowe. An Investigation of Documents from the 
WWW. Fifth International World Wide Web Conference (Paris, May 1996). 

20. Xerox Corporation. 1995 Annual Report. 



Return to Top of Page 

Return to Technical Papers Index 



Page 10 



